Record Matching to Improve Data Quality
نویسندگان
چکیده
Data Quality is defined in [TB9SJ as fitness for use, which implies that quality is relative to the use of data. Problems with data quality tend to fall into two categories: inconsistency among systems and inconsistency with reality. Format/syntax, semantic and value inconsistencies are representative of inconsistency among systems whereas incorrect and missing values are representative of inconsistencies with reality. In this paper, we address the record matchIng problem which is related to value inconsistencies and incorrect or missing values. Inconsistencies related to duplicated or partially overlapping information among systems occur if changes in one system are not reflected in the other systems for various reasons such as bad design, lack of trust among systems, etc. The difficulties inherent in attempts to identify entities from different interoperating systems (as they independently evolve over time) that refer to the same real life entity are known as the record matching problem. This is a typical problem in multi-system organizations where data residing in diverse systems needs to be merged, either for assessing financial risks or for cutting down costs associated with various projects. The methodology presented in this paper unifies a variety of techniques addressing the record matching problem, which we are considering as a classification task. The techniques used are the
منابع مشابه
Adaptive Approximate Record Matching
Typographical data entry errors and incomplete documents, produce imperfect records in real world databases. These errors generate distinct records which belong to the same entity. The aim of Approximate Record Matching is to find multiple records which belong to an entity. In this paper, an algorithm for Approximate Record Matching is proposed that can be adapted automatically with input error...
متن کاملComparison between historical population archives and decentralized databases
Differences between large-scale historical population archives and small decentralized databases can be used to improve data quality and record connectedness in both types of databases. A parser is developed to account for differences in syntax and data representation models. A matching procedure is described to discover records from different databases referring to the same historical event. T...
متن کاملAutomating the approximate record-matching process
Data Quality has many dimensions one of which is accuracy. Accuracy is usually compromised by errors accidentally or intensionally introduced in a database system. These errors result in inconsistent, incomplete, or erroneous data elements. For example, a small variation in the representation of a data object, produces a unique instantiation of the object being represented. In order to improve ...
متن کاملAn Efficient Adaptive Boundary Matching Algorithm for Video Error Concealment
Sending compressed video data in error-prone environments (like the Internet and wireless networks) might cause data degradation. Error concealment techniques try to conceal the received data in the decoder side. In this paper, an adaptive boundary matching algorithm is presented for recovering the damaged motion vectors (MVs). This algorithm uses an outer boundary matching or directional tempo...
متن کاملConditional Dependencies: A Principled Approach to Improving Data Quality
Real-life date is often dirty and costs billions of pounds to businesses worldwide each year. This paper presents a promising approach to improving data quality. It effectively detects and fixes inconsistencies in real-life data based on conditional dependencies, an extension of database dependencies by enforcing bindings of semantically related data values. It accurately identifies records fro...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2007